In this workshop, we will explore the advanced features of the Nextflow language and runtime, and learn how to use them to write efficient and scalable data-intensive workflows.
We will cover topics such as parallel Channels, Processes, and Operators.
Nextflow is based on the dataflow programming model in which processes communicate through channels.
Channel types Nextflow distinguishes between two different kinds of channels: queue channels and value channels.
Sending a message is an asynchronous (i.e. non-blocking) operation, which means the sender doesn’t have to wait for the receiving process.
Receiving a message is a synchronous (i.e. blocking) operation, which means the receiving process must wait until a message has arrived.
Queue channels are a type of channel in which data is consumed (used up) to make input for a process/operator. Queue channels can be created in two ways:
As the outputs of a process. Explicitly using channel factory methods such as Channel.of or Channel.fromPath.
When you want to create a channel containing multiple values you can use the channel factory Channel.of. Channel.of allows the creation of a queue channel with the values specified as arguments, separated by a ,.
You can use the Channel.fromList method to create a queue channel from a list object.
The fromPath factory method creates a queue channel containing one or more files matching a file path.
We have seen how to process files individually using fromPath. In Bioinformatics we often want to process files in pairs or larger groups, such as read pairs in sequencing.
For example:
Example:cat bin/example_channels.nf
#!/usr/bin/env nextflow
// Channel with explicit values
ch = Channel.of(1, 3, 5, 7)
ch.view { "value: $it" }
// Channel from a list
list = ['hello', 'world']
Channel.fromList(list).view()
// Channel from a text file
Channel.fromPath('./assets/random.txt').splitText().view()
// Channel from file pairs matching a pattern
Channel.fromFilePairs('./data/reads/*_{1,2}.fastq.gz').view()
nextflow run bin/example_channels.nf
## N E X T F L O W ~ version 23.04.1
## Launching `bin/example_channels.nf` [festering_lattes] DSL2 - revision: a673b4b440
## value: 1
## value: 3
## value: 5
## value: 7
## hello
## world
## ENSG00000157764
##
## NM_001301717
##
## chr17:43044295-43170245
##
## RS123456
##
## gene123
##
## [SRR6357076, [/home/sinrasu/nextflow-workshop/data/reads/SRR6357076_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357076_2.fastq.gz]]
## [SRR6357071, [/home/sinrasu/nextflow-workshop/data/reads/SRR6357071_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357071_2.fastq.gz]]
## [SRR6357070, [/home/sinrasu/nextflow-workshop/data/reads/SRR6357070_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357070_2.fastq.gz]]
## [SRR6357072, [/home/sinrasu/nextflow-workshop/data/reads/SRR6357072_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357072_2.fastq.gz]]
We now know how to create and use Channels to send data around a workflow. We will now see how to run tasks within a workflow using processes.
A process is the way Nextflow executes commands you would run on the command line or custom scripts.
The syntax is defined as follows:
process < NAME > {
[ directives ]
input:
< process inputs >
output:
< process outputs >
when:
< condition >
[script|shell|exec]:
< user script to be executed >
}
For example:
Example:cat bin/example_process.nf
#!/usr/bin/env nextflow
params.input = "/home/sinrasu/sinrasu-test/*_{1,2}.fastq.gz"
process FASTQC {
tag "$meta.id"
cpus 2
memory 1.GB
conda "bioconda::fastqc=0.11.9"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0' :
'biocontainers/fastqc:0.12.1--hdfd78af_0' }"
input:
tuple val(meta), path(reads)
output:
tuple val(meta), path("*.html"), emit: html
tuple val(meta), path("*.zip") , emit: zip
script:
"""
fastqc $reads
"""
}
workflow {
sample_ch = Channel.fromFilePairs(params.input, checkIfExists: true)
sample_ch.view()
FASTQC(sample_ch)
}
nextflow run bin/example_collect.nf
## N E X T F L O W ~ version 23.04.1
## Launching `bin/example_collect.nf` [jolly_goldwasser] DSL2 - revision: 0eef816596
## ['SRR6357076', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357076_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357076_2.fastq.gz], 'SRR6357071', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357071_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357071_2.fastq.gz], 'SRR6357070', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357070_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357070_2.fastq.gz], 'SRR6357072', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357072_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357072_2.fastq.gz]]
In this chapter, we take a curated tour of the Nextflow operators. Commonly used and well understood operators are not covered here - only those that we’ve seen could use more attention or those where the usage could be more elaborate. These set of operators have been chosen to illustrate tangential concepts and Nextflow features.
The flatten operator transforms a channel in such a way that every tuple is flattened so that each entry is emitted as a sole element by the resulting channel.
Example:cat bin/example_flatten.nf
input_ch = Channel.from(["path/file1.fastq", "path/file2.fastq"], ["path/file3.fastq", "path/file4.fastq"], ["path/file5.fastq", "path/file6.fastq"])
input_ch
.flatten()
.view()
nextflow run bin/example_flatten.nf
## N E X T F L O W ~ version 23.04.1
## Launching `bin/example_flatten.nf` [jovial_panini] DSL2 - revision: d714f8acfd
## path/file1.fastq
## path/file2.fastq
## path/file3.fastq
## path/file4.fastq
## path/file5.fastq
## path/file6.fastq
The collect operator collects all of the items emitted by a channel in a list and returns the object as a sole emission.
Example:cat bin/example_collect.nf
workflow {
sample_ch = Channel.fromFilePairs("./data/reads/*_{1,2}.fastq.gz", checkIfExists:true)
.collect()
sample_ch.view()
}
nextflow run bin/example_collect.nf
## N E X T F L O W ~ version 23.04.1
## Launching `bin/example_collect.nf` [condescending_bell] DSL2 - revision: 0eef816596
## ['SRR6357076', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357076_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357076_2.fastq.gz], 'SRR6357071', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357071_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357071_2.fastq.gz], 'SRR6357070', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357070_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357070_2.fastq.gz], 'SRR6357072', [/home/sinrasu/nextflow-workshop/data/reads/SRR6357072_1.fastq.gz, /home/sinrasu/nextflow-workshop/data/reads/SRR6357072_2.fastq.gz]]
The groupTuple operator collects tuples (or lists) of values emitted by the source channel, grouping the elements that share the same key. Finally, it emits a new tuple object for each distinct key collected.
Example:cat bin/example_groupTuple.nf
ch = channel
.of( ['wt','wt_1.fq'], ['wt','wt_2.fq'], ["mut",'mut_1.fq'], ['mut', 'mut_2.fq'] )
.groupTuple()
.view()
nextflow run bin/example_groupTuple.nf
## N E X T F L O W ~ version 23.04.1
## Launching `bin/example_groupTuple.nf` [romantic_curie] DSL2 - revision: 14dfed2105
## [wt, [wt_1.fq, wt_2.fq]]
## [mut, [mut_1.fq, mut_2.fq]]
The branch operator allows you to forward the items emitted by a source channel to one or more output channels.
The selection criterion is defined by specifying a closure that provides one or more boolean expressions, each of which is identified by a unique label. For the first expression that evaluates to a true value, the item is bound to a named channel as the label identifier. For example:
cat bin/example_branch.nf
workflow {
params.input = "data/samplesheet.csv"
Channel.fromPath(params.input)
.splitCsv(header: true)
.map{ row -> [[sample:row.sample,strand:row.type],[row.fastq_1,row.fastq_2]]}
.branch{ meta, reads ->
single: meta.strand == "single"
paired: meta.strand == "paired"
}
.set { samples }
samples.single.view()
}
nextflow run bin/example_branch.nf
## N E X T F L O W ~ version 23.04.1
## Launching `bin/example_branch.nf` [lethal_legentil] DSL2 - revision: 8b1f242980
## [[sample:WT_REP1, strand:single], [https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz, https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz]]
## [[sample:WT_REP1, strand:single], [https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz, https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz]]
## [[sample:WT_REP2, strand:single], [https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz, https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz]]
## [[sample:RAP1_IAA_30M_REP1, strand:single], [https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz, https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz]]
A common Nextflow pattern is for a simple samplesheet to be passed as primary input into a workflow. We’ll see some more complicated ways to manage these inputs later on in the workshop, but the splitCsv (docs) is an excellent tool to have in a pinch. This operator will parse a csv/tsv and return a channel where each item is a row in the csv/tsv:
cat bin/example_splitCsv.nf
workflow {
params.input = "https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/samplesheet/v3.10/samplesheet_test.csv"
Channel.fromPath(params.input)
.splitCsv(header: true)
.view()
}
Output:
nextflow run bin/example_splitCsv.nf
## N E X T F L O W ~ version 23.04.1
## Launching `bin/example_splitCsv.nf` [stupefied_pare] DSL2 - revision: 5650611e54
## [sample:WT_REP1, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz, fastq_2:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz, strandedness:auto]
## [sample:WT_REP1, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz, fastq_2:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz, strandedness:auto]
## [sample:WT_REP2, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz, fastq_2:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz, strandedness:reverse]
## [sample:RAP1_UNINDUCED_REP1, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz, fastq_2:, strandedness:reverse]
## [sample:RAP1_UNINDUCED_REP2, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz, fastq_2:, strandedness:reverse]
## [sample:RAP1_UNINDUCED_REP2, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz, fastq_2:, strandedness:reverse]
## [sample:RAP1_IAA_30M_REP1, fastq_1:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz, fastq_2:https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz, strandedness:reverse]
The multiMap operator is a way of taking a single input channel and emitting into multiple channels for each input element.
Example:
cat bin/example_multiMap.nf
Channel.of( 1, 2, 3, 4, 5 )
.multiMap {
small: it
large: it * 10
}
.set { numbers }
numbers.small | view { num -> "Small: $num"}
numbers.large | view { num -> "Large: $num"}
nextflow run bin/example_multiMap.nf
## N E X T F L O W ~ version 23.04.1
## Launching `bin/example_multiMap.nf` [spontaneous_mirzakhani] DSL2 - revision: 3d406c0ad2
## Large: 10
## Large: 20
## Small: 1
## Small: 2
## Large: 30
## Small: 3
## Large: 40
## Large: 50
## Small: 4
## Small: 5
The exercies below are designed to strengthen your knowledge in Nextflow more. The solution to each problem is blurred, only after attempting to solve the problem yourself should you look at the solution. Should you need any help, please ask one of the instructors.
Your Nextflow module should include the following:
Input parameters for specifying the input data (e.g., aligned BAM files).
script should contain any two functions from samtools.
sample
1 WT_REP1
2 WT_REP1
3 WT_REP2
4 RAP1_UNINDUCED_REP1
5 RAP1_UNINDUCED_REP2
6 RAP1_UNINDUCED_REP2
7 RAP1_IAA_30M_REP1
fastq_1
1 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_1.fastq.gz
2 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_1.fastq.gz
3 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_1.fastq.gz
4 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357073_1.fastq.gz
5 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357074_1.fastq.gz
6 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357075_1.fastq.gz
7 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_1.fastq.gz
fastq_2
1 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357070_2.fastq.gz
2 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357071_2.fastq.gz
3 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357072_2.fastq.gz
4
5
6
7 https://raw.githubusercontent.com/nf-core/test-datasets/rnaseq/testdata/GSE110004/SRR6357076_2.fastq.gz
strandedness
1 auto
2 auto
3 reverse
4 reverse
5 reverse
6 reverse
7 reverse